Enhanced Clustering for Forensic Analysis

Rahul D. Kopulwar, Prof. Chetan Bawankar

Computer Science and Engineering, Nagpur University, Nagpur India

*Corresponding Author Email: rahul.kopul179@gmail.com; chetan251htc@gmail.com

ABSTRACT:

In todays digital world,the forensic analysis is of great importance. Huge amount of data has been examined by analyst to present evidences in court. But, usually the files in those computers consist of data is in unstructured form. Thus, it is very difficult to analyze such data. To overcome this difficulty, the automatic analyses of data are of great interest. The algorithms for automatic clustering can be used to retrieve the interesting knowledge and useful information from the data which is unstructured and unorganized. We will propose the algorithm for clustering of data in automatic manner useful for computer forensic experts for the analysis of data. We will experiment for such things by proposing an approach of enhanced K-medoid algorithm with representatives over well known clustering algorithms. We performed experiment on the data collected from different real time crime data sources found in Police investigation FIRs’. We will propose the enhanced preprocessing techniques which can be beneficial over the well known stemmer algorithms. Finally, we summarize the results using good visualization techniques.

KEYWORDS: Improved preprocessing, k-representative, dendrogram

INTRODUCTION:

Computer forensic is the branch of computer science where huge amount of crime related data is analyzed by forensic expert to present evidences in courts. The computers at crime scenes has lots of clues in it. But, it is very difficult task to analysis data in such files found in computers by forensic experts. Because, the data found in such computers is usually unorganized and unstructured. Thus, to manually analysis of this data takes huge time and also the accuracy is of concern many times. The data found is priory a unknown type. Therefore, it is hard to make classes or category of such data. Also, there might not be any labels to the classes. Thus, machine learning algorithm and data mining algorithm are of great importance for automatic clustering in order to facilitate the analysis work. If the labeling done in previous cases, it is not happening the same labeling can be used for the present case. Therefore, it needs to make clustering algorithms which can help in automatic clustering of data from which useful information can be retrieve. And such useful information or organized information is very useful for experts to do their work easily. There are many clustering algorithms available right now. The partition based clustering algorithms for example k-means is found to be very useful. But, the lack of automatic estimation of number of clusters present might be taken as limitation. Also, the result presentation in summarized form is not found in many forensic data analysis tools. So, it is very important as well as necessary to make techniques which can be useful for better automatic estimation of number of clusters present in data files. The presentation techniques also need to be improved for better presentation of result. Another thing need to be noticed in many clustering algorithms that the preprocessing steps used that is removal of stop words and stemmers. The methods used for the preprocessing usage is not as per quality to use for further work. We found in our study, the Snowball algorithm can present comparatively good stemmer result.

LITERATURE SURVEY

The paper [1] proposed the method which focused on well-known algorithm such as (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA). The datasets collected useful for their experimentation from the computers seized in police investigations. They instantiate 16 different clustering algorithms with different parameter combinations. The two relative validity indexes were used to estimate the cluster numbers. It is found that the partition based clustering algorithm such as K-means and K-medoids is useful in many times. It yields good results. They visualize the results with dendrogram. The silhouette is found to be very useful to maintain the valid cluster.

Fuzzy methods used by Forensic Analyst implies easy, understandable and somewhat accurate rules from the given data. The Stylometric features were found useful in clustering anonymous e-mails. The Stylometry usually applicable on written text but it can also be applicable on music and paintings as well[4]. The databases selected are categorized as temporal, spacial and typological. After performing various experiments on these databases it is found that the number of database attributes is coincide with the number of clusters created. So, it could be the possible drawback as it does not produce any meaningful membership function.

Another proposed approach is evidence accumulation clustering. In this approach the multiple results from multiple clustering algorithms is considered as single data partition. Clustering ensemble, combining the evidence and extraction of consisting partition data were used to address these issues. Another approach for the document clustering proposed in[5] is to use the document frequency, term contribution, term variance quality, etc. Their approach for clustering is using the unsupervised clustering. There are various forensic tools are available in market like EnCase, Pro Discover and Forensic Toolkit. While comparing these tools, it is found that none of the tool offers the complete functionality. Because, every tool was designed with the single functionality keep in mind. These functionalities could be searching technique, hashing verification and report generation[6].

PROPOSED METHODOLOGY:

In our proposed methodology, we initially concentrate on preprocessing steps. It consists of the removal of stop word. Also our focus is on the creation of stemmer. The idea for stemmer is from the Snowball algorithm. Also the synonym technique can be beneficial for the best possible document in dataset. We collected the datasets from various real-world sources like the FIR found in various police investigation. We proposed the improved K-medoid algorithm where we use some representatives of dataset used in clustering algorithm. The improved automatic estimation of number of clusters need to be achieved with labels. The results of the above clustering output need to present using bar chart like plotting technique for better summarized analysis of data by the computer forensic experts. We present the things in this paper up to the preprocessing and our work is continue towards implementing the improved k-medoid clustering algorithm. The proposed architecture for our system is shown in fig-1.

Fig-1: Architecture for Enhanced Clustering

Dataset Collection:

We collect the datasets used in our experimentation from various Internet sources. For example: www.kiranbedi.com , www.lawctopus.com ,etc. We collect various FIR related documents from such sources.

Preprocessing:

We divide the preprocessing step in three parts such as removal of stop words, formation of stemmer and in third part we maintain a synonym dictionary. The idea behind maintaining the synonym dictionary was that suppose we search for the word to which the related data we need to find out. But, the word not suppose to present in file or document then also we can able to know the information with the help of related synonym words.

Our focus is to improve the stemmer that usually used in preprocessing. We maintain stem dictionary in which we maintain almost every possible word that generally used with English language. We set stem as the root and maintain all the related words to root. If any related word finds in document the we give the reference to the root word. Also, we apply the dictionary of words to remove the stop words. The result is shown in fig-2.

We also give the absolute path of the file where the file is located in the system. It somehow easier to reach the location of file.

CONCLUSION:

It is found that the data clustering is very difficult task. The accuracy of clustering also depends on various factors such as preprocessing of data. The stemmer need to be work accurately as it concerns a lot accurate result at later stage. The automatic estimation of number of clusters reduces a lot clustering complexity. The better the clustering technique used for clustering of data, it reduces a lot work of computer experts. Also, the results of clustering algorithm are very accurate as compared to the manual task by experts.

REFERENCES:

[1] Luís Filipe da Cruz Nassif and Eduardo Raul Hruschka “Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection” IEEE Transactions on Information Forensics and Security, Vol. 8, NO. 1, 1556-6013 IEEE-2013

[2] B. S. Everitt, S. Landau, and M. Leese, Cluster Analysis. London, U.K.: Arnold, 2001.

[3]. K. Stoffel, P. Cotofrei, and D. Han, “Fuzzy methods for forensic data analysis,” in Proc. IEEE Int. Conf. Soft Computing and Pattern Recognition, 2010, pp. 23–28.

[4]. R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem, “Towards an integrated e-mail forensic analysis framework,” Digital Investigation, Elsevier, vol. 5, no. 3–4, pp. 124–137, 2009.

[5]. B. K. L. Fei, J. H. P. Eloff, H. S. Venter, and M. S. Oliver, “Exploring forensic data with self-organizing maps,” in Proc. IFIP Int. Conf. Digital Forensics, 2005, pp. 113–123. [6]. F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, “Mining write prints from anonymous e-mails for forensic investigation,” Digital Investigation, Elsevier, vol. 7, no. 1–2, pp. 56–64, 2010.

[7]. S. Decherchi, S. Tacconi, J. Redi, A. Leoncini, F. Sangiacomo, and R. Zunino, “Text clustering for digital forensics analysis,” Computat. Intell. Security Inf. Syst., vol. 63, pp. 29–36, 2009.

Received on 20.05.2015 Accepted on 19.06.2015

Int. J. Tech. 5(1): Jan.-June 2015; Page 01-03